hw1
desriptive statistics
probability
The first homework on descriptive statistics and probability
Author

Yakub Rabiutheen

Published

September 20, 2022

Question 1

a

The data in the file UN11 contains several variables, including ppgdp, the gross national product per person in U.S. dollars, and fertility, the birth rate per 1000 females, both from the year 2009. The data are for 199 localities, mostly UN member countries, but also other areas such as Hong Kong that are not independent countries. The data were collected from the United Nations (2011). We will study the dependence of fertility on ppgdp.

Code
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.3.6      ✔ purrr   0.3.5 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.4.1 
✔ readr   2.1.3      ✔ forcats 0.5.2 
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
Code
library(alr4)
Loading required package: car
Loading required package: carData

Attaching package: 'car'

The following object is masked from 'package:dplyr':

    recode

The following object is masked from 'package:purrr':

    some

Loading required package: effects
lattice theme set by effectsTheme()
See ?effectsTheme for details.
Code
library(smss)
Warning: package 'smss' was built under R version 4.2.2
Code
##load data
data(UN11)

Qn 1.1.1

The predictor is ppgdp and the response is fertility.

Code
# Qn 1.1.1 Standard Scatterplot
library(ggplot2)
ggplot(data = UN11, aes(x=ppgdp,y=fertility)) + geom_point()

Qn 1.1.2

Code
# Log scatterplot.
scatterplot (log(fertility) ~ log(ppgdp), UN11)

Question-2.a

Annual income, in dollars, is an explanatory variable in a regression analysis. For a British version of the report on the analysis, all responses are converted to British pounds sterling (1 pound equals about 1.33 dollars, as of 2016).

Code
usdollar<- (1:10)
pound<- seq(1.33,13.3, length.out = 10)

slope<-(usdollar/pound)
slope
 [1] 0.7518797 0.7518797 0.7518797 0.7518797 0.7518797 0.7518797 0.7518797
 [8] 0.7518797 0.7518797 0.7518797

To convert from USD to GBP, the value of the response must be divided by 1.33. Same goes for the slope.

Question-2.b

How, if at all, does the correlation change?

Code
cor.test(usdollar,pound)

    Pearson's product-moment correlation

data:  usdollar and pound
t = 189812531, df = 8, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 1 1
sample estimates:
cor 
  1 

Currency Changes do not affect correlation.

Question-3

Water runoff in the Sierras (Data file: water in alr4) Can Southern California’s water supply in future years be predicted from past data? One factor affecting water availability is stream runoff. If runoff could be predicted, engineers, planners, and policy makers could do their jobs more efficiently. The data file contains 43 years’ worth of precipitation measurements taken at six sites in the Sierra Nevada mountains (labeled APMAM, APSAB, APSLAKE, OPBPC, OPRC, and OPSLAKE) and stream runoff volume at a site near Bishop, California, labeled BSAAM. Draw the scatterplot matrix for these data and summarize the information available from these plots. (Hint: Use the pairs() function.)

Code
pairs(water_supply,main = "Sierra Southern California Water Supply Runoff",
      pch = 21, bg = "green")
Error in pairs(water_supply, main = "Sierra Southern California Water Supply Runoff", : object 'water_supply' not found
Code
lm_water_supply<-lm(BSAAM~APMAM+APSAB+APSLAKE+OPBPC+OPRC+OPSLAKE,data = water)
summary(lm_water_supply)

Call:
lm(formula = BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + 
    OPSLAKE, data = water)

Residuals:
   Min     1Q Median     3Q    Max 
-12690  -4936  -1424   4173  18542 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 15944.67    4099.80   3.889 0.000416 ***
APMAM         -12.77     708.89  -0.018 0.985725    
APSAB        -664.41    1522.89  -0.436 0.665237    
APSLAKE      2270.68    1341.29   1.693 0.099112 .  
OPBPC          69.70     461.69   0.151 0.880839    
OPRC         1916.45     641.36   2.988 0.005031 ** 
OPSLAKE      2211.58     752.69   2.938 0.005729 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7557 on 36 degrees of freedom
Multiple R-squared:  0.9248,    Adjusted R-squared:  0.9123 
F-statistic: 73.82 on 6 and 36 DF,  p-value: < 2.2e-16

The following variables OPBPC, OPRC, OPSLAKE are correlated with each other.

Q.4

Code
data("Rateprof")
pairs(~Rateprof$quality+Rateprof$helpfulness+Rateprof$clarity+Rateprof$easiness+Rateprof$raterInterest, lwd=2, labels = c("QUALITY", "HELPFULNESS", "CLARITY", "EASINESS", "Rater INTEREST"), pch=19, cex = 0.75, col = "blue")

The following variables “quality”, “clarity”, and “helpfulnes are correlated with each other.

Q.5

Code
##load data
data(student.survey)
pi_conv <- as.numeric(student.survey$pi)
re_conv <- as.numeric(student.survey$re)
##run regression analysis
model1 <- lm(pi_conv ~ re_conv, data = student.survey)
summary(model1)

Call:
lm(formula = pi_conv ~ re_conv, data = student.survey)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.81243 -0.87160  0.09882  1.12840  3.09882 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.9308     0.4252   2.189   0.0327 *  
re_conv       0.9704     0.1792   5.416 1.22e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.345 on 58 degrees of freedom
Multiple R-squared:  0.3359,    Adjusted R-squared:  0.3244 
F-statistic: 29.34 on 1 and 58 DF,  p-value: 1.221e-06
Code
##run regression analysis
model2 <- lm(hi ~ tv, data = student.survey)
summary(model2)

Call:
lm(formula = hi ~ tv, data = student.survey)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.2583 -0.2456  0.0417  0.3368  0.7051 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.441353   0.085345  40.323   <2e-16 ***
tv          -0.018305   0.008658  -2.114   0.0388 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4467 on 58 degrees of freedom
Multiple R-squared:  0.07156,   Adjusted R-squared:  0.05555 
F-statistic: 4.471 on 1 and 58 DF,  p-value: 0.03879
Code
library(smss)
data("student.survey")
ggplot(data=student.survey,aes(x=re,fill=pi))+
  geom_bar() + labs(x="Religiosity", fill ="Political Ideology")

As shown in the graph,there is a strong correlation association between religiosity and Political Idealogy.

Code
data("student.survey")
ggplot(data=student.survey,aes(x=hi, y=tv)) +
  geom_point() + labs(x="High School GPA", y="Hours Watching TV")  

There is very little relationship betweeh watching TV and High School GPA.

Code
summary(student.survey[,c('pi', 're', 'hi', 'tv')])
                     pi                re           hi              tv        
 very liberal         : 8   never       :15   Min.   :2.000   Min.   : 0.000  
 liberal              :24   occasionally:29   1st Qu.:3.000   1st Qu.: 3.000  
 slightly liberal     : 6   most weeks  : 7   Median :3.350   Median : 6.000  
 moderate             :10   every week  : 9   Mean   :3.308   Mean   : 7.267  
 slightly conservative: 6                     3rd Qu.:3.625   3rd Qu.:10.000  
 conservative         : 4                     Max.   :4.000   Max.   :37.000  
 very conservative    : 2